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Abstract 

Many tasks like image segmentation, web page classification, and information extraction can be cast 
as joint inference tasks in collective graphical models. Such models exploit any inter-instance associative 
dependence to output more accurate labelings. However existing collective models support very limited kind 
of associativity — like associative labeling of different occurrences of the same word in a text corpus. This 
restricts accuracy gains from using such models. 

In this work we make two major contributions. First, we propose a more general collective inference 
framework that encourages various data instances to agree on a set of properties of their labelings. Agreement 
is encouraged through symmetric clique potential functions. We show that known collective models are specific 
instantiations of our framework with certain very simple properties. We demonstrate that using non-trivial 
properties can lead to bigger gains, and present a systematic inference procedure in our framework for a large 
class of such properties. In our inference procedure, we perform message passing on the cluster graph, where 
property-aware messages are computed with cluster specific algorithms. Ordinary property-oblivious message 
passing schemes are intractable in such setups. We show that property conformance, as encouraged in our 
framework, provides an inference-only solution for domain adaptation. Our experiments on bibliographic 
information extraction illustrate significant test error reduction over unseen domains. 

Our second major contribution is a suite of algorithms to compute messages from clique clusters to other 
clusters for a variety of symmetric clique potentials (the clique inference problem). Our algorithms are exact 
for arbitrary cardinality-based clique potentials on binary labels and for max-like and majority-like clique 
potentials on multiple labels. For majority-like potentials, we also provide an efficient Lagrangian Relaxation 
based algorithm that compares favorably with the exact algorithm. Moving towards more complex potentials, 
we show that clique inference becomes NP-hard for cliques with homogeneous Potts potentials. We present 
a ^-approximation algorithm with runtime sub-quadratic in the clique size. In contrast, the best known 
previous guarantee for graphs with Potts potentials is only i. We perform empirical comparisons on real and 
synthetic data, and show that our proposed methods for Potts potentials are an order of magnitude faster 
than the well-known Tree-based re-parameterization (TRW) and graph-cut algorithms. We demonstrate that 
our Lagrangian Relaxation based algorithm for majority potentials beats the best applicable heuristic, ICM, 
in a variety of scenarios. 



1 Introduction 



A variety of structured tasks such as image segmentation, information extraction, part of speech tagging, text 
chunking, and named entity recognition are modeled using Markov Random Fields (MRFs). For example, in 
information extraction, each sentence is treated as a MRF that captures the dependency in the labels assigned 
to adjacent words in the sentence. 

An example of such a setup is given in Figure [1] for the task of named-entity extraction (NER). The base 



model in Figure 1(b) assigns a named-entity label such as Person, Location, or Other independently to each word 



in the input. The structured model goes one step ahead and imposes a dependency between labels of adjacent 



words, shown in Figure 1(c) via chain-shaped MRF models. The model, however, ignores long range and inter- 



sentence dependencies. The collective model of Figure 1(d) encourages the labels of different occurrences of the 



same word to be the same. This is captured by connecting those occurrences with blue cliques that encode 
associative dependencies. Variants of these collective models have been proposed in the past few years for a 
variety of information extraction tasks [HI O [TOJ [H [TO] • We look at other applications of collective graphical 
models in Section [2j 



War in Iraq continues.. 
US troops in Iraq suffered.. 
..coalition troops enter Iraq. 

(a) Input sentences 



(d) Collective Model 



(b) Base Model 




(c) Structured Model 
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(e) Cluster graph for the collective model 



Figure 1: Various models for named-entity recognition illustrated on a small corpus. In Figure 1(e) the boxes 
denote separators that link the clusters. The 'size' of a separator is the number of nodes inside it. 
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Table 1 : A brief summary of various graphical models for information extraction. 



The key ingredient in a collective model is the set of potentials used to tie the individual MRFs together. These 
potentials, which can be defined over cliques of arbitrary size, encourage their vertices to have the same/similar 
label. We consider a special kind of clique potentials — symmetric potentials. Symmetric clique potentials 
are invariant under any permutation of their arguments. This restriction is meant to keep inference tractable 
with such potentials. Collective inference uses symmetric potentials that encourage all the vertices to take the 
same label. This can be enforced by choosing an appropriate bias function in the potential. Various kinds of 
symmetric potentials have been used in collective inference thus far — e.g. Potts potentials [22. 5 J and Majority 
potentials [15] . We look at various families of symmetric clique potentials in Section [31 



Properties-based Collective Inference Framework 

Figure [1] illustrates a highly common collective model in literature. Such a model enforces a very special kind of 
associativity — that labels of repeated occurrences of a word should be the same. Restricting ourselves to such 
collective models does not allow us to exploit the full power of collective inference, especially when unexploited 
associativity of a more complex nature exists in the data, e.g. all occurrences of Person should be preceded by 
titular tokens such as Mr. or Mrs. We broaden the notion of collective inference to encourage for richer forms of 
associativity amongst the labelings of multiple MRFs. This more general framework has applications in domain 
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adaptation. We illustrate this via an example. 

Example: Consider extracting bibliographic information from an author's publications homepage using a model 
trained on a different set of authors. Typically, within each home (a domain) we expect consistency in the style 
of individual publication records. For example, we expect that labelings of individual bibliographic records 
(approximately) use the same ordering of labels (say Title — > Author* — > Venue), regardless of what that 
ordering is. Another property whose conformance might be desirable is — "the HTML tag containing the Title 
of the publication" . Different bibliographic records on the same page will most probably format the title using 
the same HTML tag. Thus, we can encode this associativity by biasing the labelings to be conformant wrt this 
property. For both these properties, we only demand that the labelings agree on the property value, without 
caring for what the value actually is (which varies from domain to domain). This allows us to use the same 
property on different domains, with varying formatting and authoring styles. 

Now assume that we have an array of such conformance- promoting properties, and a sequential chain model 
trained on a set of labeled domains. We show that an effective way of adapting the trained model to a new domain 
is by labeling the MRFs in the new domain collectively while encouraging the individual labelings to agree on 
our set of properties. This is an inference-only approach, unlike many existing solutions for domain adaptation 
which require expensive model re-training p] 121 HZ] ■ As we will see in Section 17.21 using this properties-based 
framework provides significant gains on a bibliographic information extraction task. 

To summarize, our collective inference framework consists of two new components attached to any collection 
of MRFs: 

• Properties: defined over the labelings of individual MRFs, 

• Potentials: defined on the values of properties of all MRFs such that the potential value favors skewness 
in the frequencies of property values. 

We describe our framework in detail in Section 2J 

The collective inference task in our framework is to choose a labeling of the individual MRFs so as to maximize 
the sum of the scores of each MRF and the potentials of each property. This inference task is more complicated 
than independently labeling each MRF, which are typically simple tractable models like sequences. 

We address the computational challenge by defining special forms of decomposable properties and symmetric 
potentials that allow efficient inference without sacrificing usability. We exploit this special structure to design 
efficient MAP inference algorithms. Instead of ordinary belief propagation on the joint graphical model, we 
define a cluster graph with two kinds of clusters — corresponding to tractable MRFs, and cliques with symmetric 
potential functions. Two important aspects of this algorithm are as follows. First, we use combinatorial methods 
to compute cluster-specific messages. In Section |6] we present exact and approximate clique inference algorithms 
for a variety of symmetric potential functions. These algorithms are used to compute max- marginal messages 
from a clique to its neighboring clusters. Second, we exploit the form of our properties to define new intermediate 
message variables, and provide exact and approximate algorithms for computing these special message values in 
Section O In contrast, a naive application of graphical model inference could lead to an entire MRF instance 
being a separator. 

In Section [7] our experiments on real tasks show that this form of message passing is faster and more accurate 
than existing inference methods that do not exploit the form of the potentials. 

Finally in Section [8l we discuss some future directions for collective inference and outline some important 
problems in the area. 

Contributions 

Our first key contribution is a framework that encourages associativity between properties of labelings of isolated 
MRFs. We show that the framework support a large class of decomposable properties. Our properties are 
functions of a data-instance and its labeling, in contrast to the existing associative setups which model only very 
specific properties of only the instances. We give an approximate inference procedure based on message passing 
on the cluster graph, for computing the MAP labeling in our framework. Our procedure maintains tractability 
by computing property-aware messages and invoking special combinatorial algorithms at the cliques. 

The second key contribution is a family of algorithms for various kinds of symmetric clique potential functions. 
We give an 0(mn log n) MAP algorithm for cliques with arbitrary symmetric potentials, where m is the number 
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of labels, and n is the clique size. This algorithm is exact for max-like potentials, and is ^-approximate for Potts 
potentials. We show that this algorithm can be generalized to an 0{m 2 n log n) algorithm while improving the 
approximation bound to |. For majority-like potentials, we present an LP-based exact algorithm with polynomial 
but expensive runtime. We present an alternative approximate algorithm based on Lagrangian relaxation that 
is two orders of magnitude faster and and provides close to optimal quality solutions in practice. 

Finally, we show that our suite of algorithms can be plugged into the properties-based framework to achieve 
a highly expressive way for capturing associativity. We illustrate this on a bibliographic information task where 
we use properties to deploy our collective framework over unseen bibliographic domains, and achieve significant 
error reductions. 



Outline 

In Section[2l we give some real-life scenarios where collective inference can be used to exploit associativity amongst 
isolated MRF instances. We model associativity using symmetric potentials. In Section [3] we describe three 
families of symmetric potentials, that subsume the Potts and linear MAJORITY potentials. Section |4] discusses our 
properties-based collective inference framework in formal detail. Our framework ensures tractability of inference 
as long as the properties are decomposable, a notion that we cover in Section 14.11 In Section [5j we discuss the 
cluster message passing algorithm to compute the MAP in our framework. Section 15.11 presents our approach 
for exactly computing property-aware messages from an MRF instance to a clique, and Section 15.1.21 contains 
practical approximations of this exact computation. In Section [5. 21 we show that computing the reverse message 
- from a clique to a MRF instance, is the same as the clique inference problem. Then in Section [6] we present 
algorithms for solving the clique inference problem under a variety of symmetric clique potentials. The two 
key algorithms presented are the a-pass algorithm and a Lagrangian-relaxation based algorithm for majority 
potentials, in Sections 16.11 and 16.3.31 respectively. Section [7] contains experimental results of three types - (a) 
Establishing that our properties based framework leads to significant gains in a domain adaptation task, (b) Our 
clique inference algorithms are better than applicable alternatives and (c) The cluster message passing framework 
is a better way of doing inference. Finally, Section [8] contains conclusions and a discussion of future work. 



2 Applications of Collective Inference 

We review a few practical applications of collective inference in real-life tasks. Both the tasks benefit when we 
introduce associative dependencies between labelings of isolated instances. 



2.1 Information Extraction 



Recall Figure 1(d) for the task of named-entity extraction. Let the potential function for an edge between 
adjacent word positions j — 1 and j in document i be 0y (y, y') and for non-adjacent positions that share a word 
w be f w (y, y'). The goal during inference is to find a labeling y where yij is the label of word Xij in position j 
of doc i, so as to maximize: 

^2^ij(yij^i(j-i)) + Yl E U{Vij,yi'j') (1) 

i,j w xij=x irj ,=w 

The above inference problem gets intractable very soon with the addition of non-adjacent edges beyond the 
highly tractable collection of chain models of classical IE. Consequently, all prior work on collective extraction 
for IE relied on generic approximation techniques including belief propagation [221 [5] , Gibbs sampling [5] or 
stacking [15] . 

We present a different view of the above inference problem using cardinality-based clique potential functions 
C w () defined over label subsets of positions where word w occurs. We rewrite the second term in Equation Q] 
as 



\ EE (f > y >y ^K' fr") - E n v (y")/™ (»> v)) 

V 

;C u ,(n 1 (y-),...n m (y t0 )) 



E' 
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where n y (y w ) is the number of times w is labeled y in all its occurrences. The clique potential C w only depends 
on the counts of how many nodes get assigned a particular label. A useful special case of the function is when 
fwiy, y') is positive only for the case that y — y' , and zero otherwise. 

2.2 Hypertext classification 

In hypertext classification, the goal is to classify a document based on features derived from its content and 
labels of documents it points to. A common technique in statistical relational learning to capture the dependency 
between a node and the variable number of neighbors it might be related to, is to define fixed length feature 
vectors out of the neighbor's labels. In text classification, most previous approaches [2H EH [7] have created 
features based on the counts of labels in its neighborhood. Accordingly, we can define the following set of 
potentials: a node- level potential (j>i(y) that depends on the content of the document i, and a neighborhood 
potential f(y,ni(y 0i ), . . . ,n m (y 0i )) that captures the dependency of the label of i on the counts in the label 
vector y 0i of its out-links. 

i 

= + E c >i (y°' )'•••> n m (y 0, ))h = vil) 

i y 

[16] include several examples of such clique potentials, viz. the Majority potential G y (ni, . . . n m ) = <fi(y, y m ax) 
where y ma x — argmax^rtj,, and the Count potential C y (n%, . . .n m ) = Yly'm ,>o V) n v' ■ Some of these 
potentials, for example, the Majority potential are not decomposable as sum of potentials over the edges of the 
clique. This implies that methods such as TRW and graph-cuts are not applicable. [Hj rely on the Iterated 
Conditional Modes (ICM) method that greedily selects the best label of each document in turn based on the 
label counts of its neighbors. 

3 Symmetric Clique Potentials 

As seen in the example scenarios, our associative clique potentials depend only on the number of clique vertices 
taking a value v, denoted by n v , and not on the identity of those vertices. In other words, these potentials are 
invariant under any permutation of their arguments and derive their value from the histogram of counts {n^Vv}- 
We denote this histogram by the vector n. Since the potentials only depend on the value counts, we also refer 
to them as cardinality-based clique potentials in this paper. If a cardinality-based clique potential is associative, 
then it is maximized when n v — n for some v, i.e. one value is given to all the clique vertices. 

We have deliberately left the notion of a 'value' vague at this point. For existing collective models, e.g. those 
mentioned in Section [21 a value corresponds to a label. As we shall see, in our more general framework, a value 
refers to a particular member in the range of a property function. For now, we can assume wlog that a value is 
a member of some discrete finite set V. 

We consider specific families of clique potentials, many of which are currently used in real-life tasks. In 
Section[6]we will look at various potential-specific exact and approximate clique inference algorithms that exploit 
the specific structure of the potential. 

In particular, we consider the three types of clique potentials listed in Tabled 

3.1 max clique potentials 

These clique potentials are of the form: 

C(rai,...,niv|) = max/ U (n t) ) (2) 

V 

for arbitrary non-decreasing functions f v . When f v {n v ) — n v , we get the makespan clique potential which has 
roots in the job-scheduling literature. 

In Section 16.11 we present an algorithm, called a-pass, that solves the clique inference problem for max 
potentials exactly. The algorithm runs in time 0(|V|nlogn), where n is the clique size, Although max potentials 
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Name 


Form 


Remarks 


MAX 


max„ f v (n v ) 


fv is 


a non-decreasing function 


SUM 




f v non-decreasing. 


Includes the Potts potential = A n v 


MAJORITY 


/ a (n), where a = argmax^n,, 




f a is typically linear 



Table 2: Various kinds of symmetric clique potentials considered in this paper, n = (ni, 
counts of various values among the clique vertices. 



i\V\) denotes the 



are not used directly in real-life tasks, they are relatively easier potentials to tackle and provide key insights to 
deal with the more complex SUM potentials. As we will see, the a-pass algorithm that we derive for this potential 
can be easily ported to other more complex potentials. For the case of Potts potentials, we will prove that the 
a-pass algorithm provides a ^-approximation. 



3.2 SUM clique potentials 

SUM clique potentials are of the form: 



c(n lt . 



*\v\) 



EM' 



(3) 



These form of potentials includes the special case when the well-known Potts model is applied homogeneously 
on all edges of a clique. Let A be the Potts potential of assigning two nodes of an edge the same value. The 
summation of these potentials over a clique is equivalent (up to a constant) to the clique potential: 



C 



Potts 



C(ni, . . . ,n m ) = X^Tnl 



(4) 



The Potts model with negative A corresponds to the dis-associative case when edges prefer the two end points 
to take different values. The more interesting case is when A is positive. For this case, we will borrow the 
a-pass algorithm for MAX potentials and show that it gives a ^-approximation. 

3.3 majority potentials 

MAJORITY potentials have been used for a variety of tasks such as link-based classification of web-pages [IB] 
and named-entity extraction [15] . A majority potential over a clique C is parameterized by a |V| x |V| matrix 
W = {w vv >}. The role of W is to capture the co-existence of different value pairs in the same clique. 

Co-existence allows us to downplay 'strict associativity' viz. giving all vertices of a clique the same value. The 
justification for co-existence is as follows. Consider the conventional collective inference model for named-entity 



recognition (Figure 1(d)) where a value corresponds to a label. Suppose the word 'America' occurs in a corpus 
multiple times. Then all occurrences of 'America' will be joined with an associative clique. However, some 
occurrences of America correspond to Location, while others might correspond to an Organization, say Bank of 
America. Thus we require most but not all vertices in the America clique to be labeled similarly. This motivates 
the need for a clique potential with scope for co-existence. 

Coming back to W, a highly positive w vv i would suggest that the values v and v' should be allowed to co-exist 
in a clique, when v is the majority value in the clique. We allow w vv i to be negative to model mutual-exclusion 
amongst value pairs. Our algorithms work for unrestricted W , although in practice the training procedure that 
learns W might add some constraints. 

We know define majority potentials as: 



C(m,. 



l \v\, 



fain) 



argmax^ n v 



(5) 



We consider linear majority potentials where / a (n) = J2 V w av n v . The matrix W = {«w} need not be diagonally 
dominant or even symmetric. Unlike Potts potential, majority potential cannot be represented using edge 
potentials. 



(. 



4 Generalized Collective Inference Framework 



We now discuss our framework for generalized collective inference. Recall that we wish to encourage the labelings 
of various isolated MRFs to agree on a set of properties. Our generalized collective inference framework consists 
of three parts: 

1. A collection of structured instance-labeling pairs {(xj,yj)}^. 1 where each is probabilistically modeled 
using a corresponding Markov Random Field (MRF). Let 0(xj,yj) be a scoring function for assigning 
labeling y$ to Xj using the MRF. The scoring function decomposes over the parts c of the MRF as </>(xj, y) = 
E c ^c(x 4 ,y c ). 

2. A set P of properties where each property p £ P includes in its domain a subset T>p of MRFs and maps 
each labeling y of an input x £ T>p to a discrete value from its range TZ'p. Each property decomposes over 
cliques of the MRF. We discuss decomposable properties in Section I4TT1 

3. A clique potential Cp({p(xi, yi)} Xi ex> P ) f° r each property p. This potential is a symmetric function of its 
input. We elaborated on various symmetric potential functions in Section [3] These potentials encourage 
conformity of properties across labelings of multiple MRFs. 

The collective inference task is to label the N instances so as to maximize the sum of the individual MRF 
specific scores and the clique potentials coupling many MRFs via the property functions. This is given by: 

N 

max V <K x i>yi) + 52 Cp({p(x h yi)} ieT > p ) (6) 
(yi <-< yN U=i pep 

Even for symmetric Cp and binary labels, Equation [6] is NP-hard to optimize. One well-known hard case is the 
Ising model, where each Cp is a Potts potential. Thus we look at an approximate approach based on message 
passing that we elaborate in Section [5] 

4.1 Decomposable Properties 

A property maps a (x, y) pair to a discrete value in its range. Typically, since y is exponentially large in the size 
of x, we cannot solve Equation[6]tractably without constructing the value of a property from smaller components 
of y. We define decomposable properties as those which can be broken over the parts c of the MRF of labeling y, 
just like 4>. Such properties can be folded into the message computation steps at each of the MRFs, as we shall 
see in Section [S] We now formally describe decomposable properties: 

Definition 4.1. A decomposable property p(x, y) is composed out of component level properties p(x, y c ,c) defined 
over parts c of y. p : (x, y c , c) i— > 1Z P U {_L} where the special symbol _L means that the property is not applicable 
to (x, y c ,c). p(x, y) is composed as: 

f if Vc:p(x,y c ,c) = J_ 
K x >y)-S v if Vc :p(x,y c ,c) £ {w,_L} (7) 
i _L otherwise. 

The first case occurs when the property does not fire over any of the parts. The last case occurs when y has 
more than one parts where the property has a valid value but the values are different. The new range TZ'p now 
consists of IZp and the two special symbols _L and 0. 

We show that even with decomposable properties we can express many useful types of regularities in labeling 
multiple MRFs arising in applications like domain adaptation. 

Example 1 We start with an example from the simple collective inference task of [HJ [SJ [§] of favoring the same 
label for repeated words. Let x be a sentence and y be a labeling of all the tokens of x. Consider a property 
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p, called TokenLabel, which returns the label of a fixed token t. Then, Dp comprises of all x which have the 
token t, and IZp is the set of all labels. Thus, if x € Vp, then 



p(x, y c , c) = 



Vc x c = t 

_L otherwise 



(8) 



and, given y and x £ T>p, 



p(x,y) = 



y all occurrences of t in x are labeled with label y 
_L otherwise 



(9) 



Example 2 Next consider a more complex example that allows us to express regularity in the order of labels 
in a collection of bibliography records. Let x be a publications record and y its labeling. Define property p, 
called NextLabel, which returns the first non-Other label in y after a Title. A special label 'End' marks the 
end of y. So IZp contains 'End' and all labels except Other. Thus, 



p(x.,y c ,c) = < 



Therefore, 



p(x,y) 



/3 y c = Title A y c+t = (3 A (Vj : < j < i : y c+J = Other) 
End y c = Title A c is the last clique in y 
J_ y c + Title 

y has no Title 

j3 j3 is the first non-Other label following each Title in y 
_L otherwise 



(10) 



(11) 



Example 3 In both the above examples the range 1Z' P of the properties was labels. Consider a third property, 
called BeforeToken whose range is the space of tokens. This property returns the identity of the token before 
a Title in y. So, 

Xc-i Vc = Title A (c > 0) 
p(x, y c , c) = { 'Start' y c = Title A (c = 0) (12) 
J_ y c ? Title 



Therefore, 



p(x,y) = < 



No Title in y 

'Start' The only Title in y is at the beginning of y 

t All Titles in y are preceded by token t 

_L y has two or more Titles with different preceding tokens 



(13) 



Some important families of symmetric clique potentials have been described in [10j . and recalled in Section [3] 
We use two most widely used of those families, called Potts and majority, defined as: 



c Potts ({ Wl ,...,M) ^ a£t$ 

v' 

C Maj ({wi, . . . ,V n }) = y^Wj) V >n v >, V = argmax^rit, 

v' 

where n v is frequency of v in the multiset {vi, . . . , v n } and A, w are fixed parameters of the potentials. 
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NextLabel 



TokenLabel 



Figure 2: Cluster graph for a toy example with three chain-shaped MRF instances and two properties. The 
TokenLabel property has thin separators, while NextLabel has separators that consist of the entire instance. 
Both the properties have associative potentials defined on their cliques (shown as blue circles). 



5 MAP Estimation in the Generalized Collective Inference Frame- 
work 

The natural choice for approximating the NP-hard objective in Equation[6]is ordinary pairwisc belief propagation 
on the joint model. This approach does not work due to many reasons. First, some symmetric potentials like 
the majority potential cannot be decomposed along the edges. Second, property-aware messages cannot be 
computed for arbitrary message passing schedules. Third, cluster-membership information of vertices, which is 
very vital, is not exploited at all. 

Other approaches like the stacking based approach of [15j are specific to particular symmetric potentials and 
do not exploit the full set of messages to compute a more accurate MAP. 

Hence we adopt message passing on the cluster graph of the model as our approach, akin to the one proposed 
by [8]. We create a top-level cluster graph model where the clusters correspond to the N instances and \P\ 
property cliques. The cluster node of each instance is internally another nested MRF. The cluster for a property 
p is a clique whose vertices correspond to instances in Dp. Figure [2] illustrates an example with two properties 
and three data instances. 

For complex properties like the NextLabel property of Section 14.11 the separator between a MRF cluster 
and a property cluster is the entire instance. This is a major departure from known collective models such as the 



one in Figure 1(e) Known collective models use highly simple properties, e.g. TokenLabel in Figure 1(e) which 
lead to single vertex separators because the property clique is incident on instances only through a single token, 
which is known in advance. In the case of complex properties like NextLabel, not only is the property clique 
incidence information missing, but the clique's incidence is dependent on the property of the entire labeling and 
not just a single token's label. This causes the entire instance to be a separator between the property cluster 
and the instance MRF cluster. Therefore, naive message passing schemes whose runtime is exponential in the 
separator size are inapplicable here. However we exploit the decomposability of properties to simplify message 
updates. 

The setup of message passing on the cluster graph allows us to exploit potential-specific algorithms at the 
cliques, and at the same time work with any arbitrary clique potential. It also allows intuitive computation of 
property-aware messages. 

Let Tuii^p and rrip—>i denote message vectors from instance % to an incident property clique p and vice-versa. 
Let v £ TV p denote a property value. Next we discuss how these messages are computed. 

5.1 Message from an Instance to a Clique 

The message mi^ p (v) is given by: 

/ \ 



rrii^ p (v) = max 

y.p(xi,y)=v 



i p'¥=p- I 



(14) 



To compute mi—> p (v), we need to absorb the incoming messages from other incident properties p' ^ p, and do 
the maximization. When a property p' is applicable to only a single fixed clique c of the instance, we can easily 



9 



Instance i 




M(ui,...,U\k\) 





s 










H — yp 



Figure 3: Computation of the message mj_> p . Instance i is incident to \K\ properties p\, . . . ,P\k\ where p = p\. 
The green portion shows the internal messages /(.) in instance i. Final message is computed in terms of 

the aggregated message M(u\, . . . ,ttijfi) and any incoming messages m P;j —>j, j > 1 (the red portion). 



absorb the message m p i^i by including it in the potential of the clique, <p c . This is true, for instance, for the 
TokenLabel property in the previous section when within a sentence the word does not repeat. In the general 
case, absorbing messages that are applicable over multiple cliques requires us to ensure that the cliques agree on 
the property values following Equation [7l We will refer to these as multi-clique properties. 

We first present an exact extension of the message passing algorithm within an instance MRF to enforce 
such agreement and later present approximations. Let K be the set of multi-clique properties in whose domain 
instance lies. We are targeting applications where K is small. 

5.1.1 Exact Messages 

Figure [3] shows the various messages involved in computing the message m^ p . We describe the procedure step- 
by-step. Consider an internal message 7 c ^ c '(y s ) between adjacent cliques c and c' inside the MRF of instance i 
with s as the separator. This is computed in standard message passing as follows: 

i 

7 c ^c'(y s )= naax <^ c (y c ) + V"/ c ^ c (y s ) (15) 
y c ~y s * — ' 

where c\ . . . ci denote the I neighboring cliques of clique c excluding c' and s\ . . . s; are the corresponding sepa- 
rators. To handle multi-clique properties, we augment these internal messages to maintain state about the set 
of properties already encountered in any partial labeling up to c. 

For ease of explanation, first consider the case where \K\ — 1, and let p be the only property relevant for 
instance i. We maintain messages of the kind I c ^ c i(y s ,u), where u £ 1Z' P is the called the value argument of 
/. These messages compute the following quantity: what is the score of the best partial labeling y part up to c, 
which is consistent with y s , such that p(x, y pa rt) = u (as per Definition 14. ip if we ignore the cliques beyond c'. 
To compute this message, we consider only the following entities: 

1. All local labelings y c , consistent with y s , such that p(x, y c ,c) does not conflict with u. 

2. Incoming messages at c (except from c') of the kind I Ci ^ c {y Si , Vi) such that Vi does not conflict with u, 
and all the Vi's together with p(x, y c , c) can produce a property value u at c. 

Another way to look at it is as follows. Consider the green portion of Figure [31 and let Vi, . . . , Vi be / candidate 
property values produced by some partial labelings running up to c%, . . . , ci respectively. Also consider a local 
labeling y c at c, consistent with y s . Then computing the message I c ^ c '(y a ,u) is the same as composing u 
by picking a combination of y c and v\, . . . , vi such that p(x, y c , c), v\, . . . ,vi can be amalgamated into u via 
Definition 14.11 If no such combination exists, then the message is — oo, else we return the combination with the 
highest total score. 

Thus, depending on u, Equation [T5l is modified as follows: 
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Case 1: u = 0: To get a value of 0, the property should fire neither at c, nor up to or at any of the Cj's. That 
is, we need p(x, y c ,c) = and incoming messages at c whose value arguments are also 0. Thus i c _ c '(y s , 0) is 
computed as: 

i 

max 4> c {y c ) + Ic } ^c(y Sj , 0) 

p(x,y c ,c)=_L 3=1 

Case 2: u 6 7l p : In this case, p(x, y c ,c) the value arguments of its incoming messages should be one of u 
and 0, with at least one of them being u (otherwise all of them will be and we will get at c). This will hold 
if the predicate Vi(y c , Ui, • • • , uj), defined below, is true. 

K 1 (y c ,«i,... ) t;O = (V , j:t; i €{«,0})A(p(x,y c ,c)e{u,J-})A(p(x,y o ,c) = u V Bvj = u) (16) 

Here vi, . . . ,i>z denote value arguments of the incoming messages at c excluding the one from c'. If the set 
{v±, . . . , vi,p(ic, y c , c)} contains two distinct values from 1Z P , then they will conflict and create _L at c on compo- 
sition. Thus, V\ precisely and completely represents the set of valid combinations for producing u. Using V\ we 
can compute I c -> c i(y s ,u) as: 

I 

y.~y^S?...,«,:^ c ^ + ^2 I c ] ^c(y S] ,v J ) (17) 
Vi(y c ,vi,...,vi) 3=1 

Case 3: u =_L: We can produce _L when either (a) value arguments of two or more incoming messages at c 
conflict or (b) the property value at c conflicts with the value argument of one of the incoming messages or (c) 
either the property value at c or any one of the value arguments is _L. The predicate V2(y c , «i, . ..,Vi) returns 
true if any one of the above outcomes hold: 

V 2 {yc, vi,...,vi) = (3j : Vj =±) V 0(x, y c , c) =_L) V (3j, k : Vj ^ v k A Vj,v k G Tip) 

V (p(x,y c ,c) = wo, v e Tip) A (3j : Vj ^ v , Vj € Tip) (18) 

The outgoing message I c ^ c i (y s , u) is then: 

i 

I c ^ c ,(y s ,±) = y ^max vi (f> c {y c ) + ^ I Cj ^c{y Sj , vj) (19) 

V2(yc,v i,.. .,-!)() j=l 

After completing the internal message passing schedule, we can compute the final aggregated message M(u) 
by sending a message from the last clique (wlog, say d) to a dummy root clique r: 

M(u) ±I c ,^ r (-,u) (20) 

where the separator labeling is irrelevant because the message is to a dummy clique r. 

If | if | = 1, then the message mi^ p (v) is simply M(v), This treatment generalizes nicely to the case when 
|if| > 1. We extend the internal message vector i(.) for each combination of values of the |if | properties. Call it 
i c _> c '(y s ,Ul, . . . u\k\) where Uj G 71' Pj ■ Let the final aggregated message at a dummy root clique inside instance 
i be M(u\, . . . u\k\)- The outgoing message mi^ p (v) to property p can now be computed as: 

mi_> p («) = Ui max i M(m,...u\ K \) + m p >^i(u p >) (21) 

Up=V p'^p 

The overhead for |if| > 1 is that we have to absorb the incoming messages from the other |if | — 1 properties, 
shown in Figure [3] in the red portion. 

5.1.2 Approximations 

Various approximations are possible to reduce the number of value combinations for which messages have to be 
maintained. 
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Typically, the within-instance dependencies are stronger than the dependencies across instances. We can ex- 
ploit this to reduce the number of property values as follows: First, find the MAP of each instance independently. 
Only those property values that are fired in the MAP labeling of at least one of the instances are considered in 
the range. In the application we study in Section f7.2[ this reduced the size of the range of properties drastically. 

A second trick can be used when properties are associated with labels that tend not to repeat in the MRF, 
e.g. Title for the citation extraction task. In that case, the value _L can be ignored. And, we can relax the 
consistency checks on properties p 1 that are being absorbed so that they can be absorbed locally at each clique 
as follows. First, normalize incoming messages from p' as m p '—>i(v) ~ mj/_>i(0). Next, absorb the normalized 
message in the clique potential </> c (yc) of all cliques where p'(x, y c , c) — v. Finally, compute the outgoing message 
by only keeping state over the values of the outgoing property. When the MAP does not contain any repeat 
firings of a property, this method returns the exact answer. 

A third option is to depend on generic search optimization tricks like beam search and rank aggregation. In 
beam search instead of maintaining messages for each possible combination of property values, we maintain only 
top-k most promising combinations in each message passing step. 

5.2 Message from a Clique to an Instance 

The message m p —>i(v) is computed as: 

m p ^i(v)= max V m^p(uj) + C p ({vj} XjeT > p ) (22) 

(vi,...,v n ): 

Vi=v J^jti 
XjGX> p 

Message m p ^i(v) requires maximizing the objective in Equation 1221 which can be re- written as 

-mi^ p (v) + max V m^pivj) + C p ({vj}^. ev ) 

(vi,...,v n ) . 

Vi=v j-.XjETJp 

The maximization subtask can be cast in terms of the general clique inference problem defined as: 

Definition 5.1. Given a clique over n vertices, with a symmetric clique potential C(v\, . . . , v n ), and vertex 
potentials ipjvj for all j < n and values Vj . Compute the value assignment with the highest potential: 

n 

max y ip jv + C(vi ,...,«„) (23) 

3=1 

In our case, tpj V = nrij^ p {v) and C = Cp. To compute m p _»i(u), we can solve the clique inference problem 
with the restriction Vi = v. 

We are interested in the cases when the clique potential is Potts or majority, which were defined in SectionHJ 
These are most popular potentials for real-life collective inference tasks. 

In [10], a ^-approximate clique inference algorithm, called a-pass was presented for C Potts , along with 
an expensive polynomial-time exact algorithm for C Ma -'. a-pass can also be applied to arbitrary symmetric 
potentials and is exact for binary valued properties. The time complexity of a-pass is 0(\lZ' p \n log n), as compared 
to (\7Z' p \ 2 n 2 ) for ordinary belief propagation. 

We next show that although a-pass is also applicable for majority potentials, it lacks desirable theoretical 
guarantees. We then present a new approximate inference algorithm for C Ma ^ based on Lagrangian Relaxation 
which is much faster than the exact algorithm yet produces almost-optimal scores in practice. 

6 Algorithms for Clique Inference 

In this section we explore various exact and approximate schemes for maximizing the clique inference objective 
in Definition 15.11 under a variety of symmetric potential functions. Of particular interest are the Potts and 
MAJORITY potentials, but some of the algorithms are more general and apply to families of potentials. 

These clique algorithms are called as subroutines while calculating the messages from property cliques to 
instances, in accordance with Equation 1221 Throughout this section, we assume that the clique corresponds to 
a fixed property p with range TV p. R will be short-hand for \TZ'p\. 
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We will use F(vi, . . . , v n ) to denote the clique inference objective. As short-hand, we will denote F(v\, . . . , v n ) 
by F(v) = ip(v) + C(v) where i/j(v) is the vertex score of v and the second term is the clique score. Wlog assume 
that the vertex potentials are positive. Otherwise a constant can be added to all of them and that will not affect 
the maximization. The best value assignment will be denoted by v*, and v will denote an approximate solution. 

6.1 a-pass Algorithm 

We begin with max potentials. These potentials are not used in practice, but clique inference for MAX potentials 
gives rise to the a-pass algorithm which has very interesting properties. Recall that a max potential is of the 
form C({«i, . . . ,v n }) — max„ f v (n v ). The a-pass algorithm is described in Algorithm [T] 

Input: Vertex Potentials ijj, Clique Potential C, set TZ'p of allowed values 
Output: Value assignment v\, . . . , v n 
Best = — oo; 
v = nil; 

foreach Value a £ TZ'p do 

Sort the vertices by the metric tpj a — m.8kX V £H'p,v^a ipjvi 

foreach k € {1, . . . , n} do 

Assign the first k sorted vertices the value a; 
Assign the remaining vertices their individual best non-a value; 
s <— score of this assignment; 
if s > Best then 
Best <— s; 

v <— current assignment; 
end 
end 
end 

return v; 

Algorithm 1: The a-pass algorithm 

For each (a, k) combination, the a-pass algorithm computes the best k vertices to get the value a. Let ir ak 
denote the complete assignment in the (a, k) th step. Then it is easy to see that a-pass runs in 0(\1Z' p\n log n) 
time by incrementally computing F(y ak ) from F(v Q ( fe_1 )). We now look at properties of a-pass. 

Claim 6.1. Assignment v ak has the maximum vertex score over all v where k vertices are assigned a, that is, 
iP(v ak ) = ?7iaa; v:ncv ( v ) =fc V(v). 

Proof. This is easily seen by contradiction. If some other assignment v ^ ir ak has the best vertex score, then 
it differs from v"* in the assignment of at least two vertices, one of which is assigned a in v and non-a in v Qfc . 
The converse holds for the other differing vertex. By swapping their assignments, it is possible to increase the 
vertex score of v, a contradiction. 

Claim 6.2. For MAX potentials, C(v ak ) > f a (k). 

Proof. This is because the value a has a count of k and the MAX potential considers the maximum over all 
counts. 

Theorem 6.1. The a-pass algorithm finds the MAP for MAX clique potentials. 

Proof. Let v* be the optimal assignment and let j3 = argmax^/„(n 1 ,(v*)), t — n^(v*). Let v be the assignment 
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found by a-pass. We have: 



F(v) = max F(v ak ) 

l<a<\K'p\,l<k<n 

> F(v^) 

= V(v^) + C(v^) 

- ^(v^) + C(v*) 

> ^(v*)+C(v*) 
= F(v*) 



The second and third inequalities follows from Claims 16.21 and 16.11 respectively. □ 

Thus, a-pass finds the optimal assignment for the MAX family of potentials in 0(Rn log n) time. We now 
move on to SUM potentials. 

6.2 Clique Inference for SUM Potentials 

We will mainly focus on the Potts potential, which is arguably the most popular member of the SUM family. 
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Potts potential is given by C(v) = \J2v 

When A < 0, the clique edges prefer the two end points to take different values. With negative A, our objective 
function F(v) becomes concave and its maximum can be easily found using a relaxed quadratic program followed 
by an optimal rounding step as suggested in [21]. We therefore do not discuss this case further. The more 
interesting case is when A is positive. We show that finding v* now becomes NP-hard. 

Theorem 6.2. When C(v) = \J2v n 2 , A > 0, finding the MAP assignment is NP-hard. 

Proof. Let R = \lZ'p\. We prove hardness by reducing from the NP-complete exact cover by 3-sets problem [19] 
of deciding if exactly ^ of R subsets S\,...,Sr of 3 elements each from U = {ei, . . . e„} can cover U. We let 
elements correspond to vertices and sets to values. Assign ipi V = 2nX if ei £ S v and otherwise. MAP score will 
be (2n 2 + 3 2 ^)A iff we can find an exact cover. □ 

The above proof establishes that there cannot be an algorithm that is polynomial in both n and R. But we 
have not ruled out algorithms with complexity that is polynomial in n but exponential in R, say of the form 
0(2 R n c ) for a constant c. 

We next propose approximation schemes. Unlike for general graphs where the Potts model is approximable 
only within a factor of ^ [12] , we show that for cliques the Potts model can be approximated to within a 
factor of j| w 0.86 using the a-pass algorithm. We first present an easy proof for a weaker bound of | and then 
move on to a more detailed proof for the y| bound. Recall that the optimal assignment is v* and the assignment 
output by a-pass is v. 

Theorem 6.3. F(v) > §F(v*). 

Proof. Without loss of generality assume that the counts in v* are n\ > n2 > . . . > n^, where R — \H'p\. Then 

F(v) > F(v lni ) = V(v lni ) + C(v lni ) 

> ip(v*) + C(v lni ) (from Claim [63]) 

> ip(v* ) + \n\ (since A > 0) 

> ip(v*) + C(v*) - Xnin + \n\ 

> F(v*)-An 2 /4 

Now consider the two cases where F(v*) > §An 2 and F{\*) < |An 2 . For the first case we get from above that 
F(v) > F(v*) — An 2 /4 > |F(v*). For the second case, we know that the score F(y mn ) where we assign all 
vertices the last value is at least An 2 and thus F(v) > |F(v*). □ 
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We now state the more involved proof for showing that a-pass actually provides a tighter approximation 
bound of j| for Potts potentials. 

Theorem 6.4. F(v) > j§F(v*). 

Proof. The proof is by contradiction. Suppose there is an instance where F(v) < j|F(v*). Wlog assume that 
A = 1 and ri\ > 122 > ■ ■ ■ > n^ > 0, (2 < k < R) be the non-zero counts in the optimal solution and let ip* be 
its vertex potential. Thus F(v*) = tp* + nf + n| + • • ■ + n\. 

Now, F(v) is at least V* + n\ (ref. Claim 11}. This implies ^, + ^^"' + „a < jf, i.e. 2(tp* + n\) < 

13(n| + . . . + n\) or 

r < ^{nl + ... + nl)-n\ (24) 

Since fc values have non-zero counts, and the vertex score is ip* , at least i/j* /k of the vertex score is assigned 
to one value. Considering a solution where all vertices are assigned to this value, we get ^(v) > ip* /k + n 2 . 
Therefore F(v*) > 15/13(n 2 + ip*/k). 
Since F(v*) = ip* + nj + . . . + n\, we get: 

15fcn 2 - Uk{n\ + ... + n 2 k ) 

r > w-T 5 (25) 

We show that Eg uat ions [24l and [25l contradict each other. It is sufficient to show that for all n\ > . . . > rife > 1, 

15kn 2 - I3k( n 2 + ...n 2 ) 13, 2 ?N 2 

13^15 ^>y(^ + - + - 2 fc )-? 

Simplifying, this is equivalent to 

kn 2 - y (k - l){n\ + . . . + n 2 k ) - n\ > 0. (26) 

Consider a sequence m, . . . , n& for which the expression on the left hand side is minimized. If > Hj+i 
then we must have ni = 1 VZ > i + 2. Otherwise, replace rij+i by rij+i + 1 and decrement rij by 1, where j is 
the largest index for which nj > 1. This gives a new sequence for which the value of the expression is smaller. 
Therefore the sequence must be of the form m = n\ for 1 < i < I and ni = 1 for i > I, for some I > 2. Further, 
considering the expression as a function of n;, it is quadratic with a negative second derivative. So the minimum 
occurs at one of the extreme values n\ = 1 or rt; = n%. Therefore we only need to consider sequences of the form 
711, . . . , m, 1, . . . , 1 and show that the expression is non-negative for these. 

In such sequences, differentiating with respect to n\, the derivative is positive for n\ > 1, which means that 
the expression is minimized for the sequence 1, . . . , 1. Now it is easy to verify that it is true for such sequences. 
The expression is zero only for the sequence 1,1,1, which gives the worst case example. □ 



The next theorem that the analysis in Theorem l6.4l is tight. We present a pathological example where a-pass 
gives a solution which is exactly y| of the optimal. 

Theorem 6.5. The approximation ratio of y| of the a-pass algorithm is tight. 

Proof. We show an instance where this is obtained. Let R = n + 3 and A = 1. For the first n/3 vertices let 
tp u i — 4n/3, for the next n/3 vertices let ipu2 = 4n/3, and for the remaining n/3 let "0 U 3 = 4n/3. Also for all 
vertices let V\i("+3) ~ 4n/3. All other vertex potentials are zero. The optimal solution is to assign the first three 
values n/3 vertices each, yielding a score of 4n 2 /3 + 3(^-) 2 = 5n 2 /3. The first a-pass on value 1, where initially 
a vertex u is assigned its vertex optimal value u + 3, will assign the first n/3 vertices 1. This keeps the sum 
of total vertex potential unchanged at 4n 2 /3, the clique potential increased to n 2 /9 + 2n/3 and total score = 
4n 2 /3 + n 2 /9 + 2n/3 = 13n 2 /9 + 2n/3. No subsequent iterations with any other value can improve this score. 
Thus, the score of a-pass is j| of the optimal in the limit n — ► 00. □ 
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6.2.1 a-expansion 

In general graphs, a popular method that provides the approximation guarantee of 1/2 for the Potts model is 
the graph-cuts based a expansion algorithm [5]. We explore the behavior of this algorithm for Potts potentials. 

In this scheme, we start with any initial assignment — for example, all vertices are assigned the first value 
as suggested in [5]. Next, for each value a we perform an a expansion phase where we switch the assignment of 
an optimal set of vertices to a from their current value. We repeat this until in a round over the R values, no 
vertices switch their assignment. 

For graphs whose edge potentials form a metric, an optimal a expansion move is based on the use of the 
mincut algorithm of [5] which for the case of cliques can be 0(n 3 ). 

We next show how to perform optimal a expansion moves more efficiently for all kinds of SUM potentials. 

An a expansion move Let v be the assignment at the start of this move. For each value v ^ a create a 
sorted list S v of vertices assigned v in v in decreasing order of ipi a — ipi V . If in an optimal move, we move k v 
vertices from v to a, then it is clear that we need to pick the top k v vertices from S v . Let r$ be the rank of a 
vertex i in 5„. Our remaining task is to decide the optimal number k v to take from each S v . We find these using 
dynamic programming. Without log of generality assume a = R and 1, . . . , R — 1 are the R — 1 values other 
than a. 

Let Dj[k) denote the best score with k vertices assigned values from 1 . . . j switched to a. We compute 

Dj(k) = max _ Dj-i(k — I) + fj(nj(v) — I) 

l<k,l<rij(v) 

i':r i i<.l i':r i />l 

From here we can calculate the optimal number of vertices to switch to a as axgm.ax k<n _ n /^Dii—i(k) + f a (k + 
n a (v)). 

Theorem 6.6. The a-expansion algorithm provides no better approximation guarantee than 1/2 even for the 
special case of homogeneous Potts potential on cliques. 

Proof. Consider an instance where R = k + 1, and A = 1. Let tp u i = 2n/k for all u and for k disjoint groups of 
n/k vertices each, let V'u.i+i — 2n for the vertices in the i th group. All other vertex potentials are zero. Consider 
the solution where every vertex is assigned value 1. This assignment is locally optimal wrt any a-expansion 
move, and its score is n 2 (l + 2/fc). However, the exact solution assigns every vertex group its value, with a score 
n 2 (2 + 1/k) , thus giving a ratio of 1/2 in the limit. □ 

We next present a generalization of the a-pass algorithm that provides provably better guarantees while being 
faster than a-expansion. 

6.2.2 Generalized a-pass algorithm 

In a-pass for each value a, we go over each count k and find the best vertex score with k vertices assigned value 
a. We generalize this to go over all value combinations of size no more than q, a parameter of the algorithm 
that is fixed based on the desired approximation guarantee. 

For each value subset A C TZ'p of size no more than g, and for each count k, maximize vertex potentials with 
k vertices assigned a value from set A. For this, sort vertices in decreasing order of max Q6 ^ ipi a — max„^ ipvuy, 
assign the top k vertices their best value in A and the remaining their best value not in A. The best solution 
over all A, k with \A\ < q is the final assignment v. 

The complexity of this algorithm is 0(nR q logn). In practice, we can use heuristics to prune the number of 
value combinations. Further, we can make the following claims about the quality of its output. 

Theorem 6.7. F(v) > §F(v*). 

Proof. This bound is achieved if we run the algorithm with q = 2. Let the optimal solution have counts 
ni > n 2 > . . . > nfj and let its vertex potential be ip* . For simplicity let a — ni/n, b — n 2 /n and c = f/;*/n 2 . 
Then F(v*)/n 2 < c + a 2 + 6(1 - a), F(v)/n 2 > c + a 2 and F(v)/n 2 > c+^^-. 



16 



Case 1: o > fc^L. Then F(v*) - F(v) < 6n 2 (l - a). For a given value of a, this is maximized when 6 is 
as large as possible. For Case 1 to hold, the largest possible value of b is given by a 2 = which gives 

b = a(V2 - 1). Therefore F(v*) - F(v) < " 2( ^ 1} < i.e. F(v) > §F(v*). 

Case 2: a 2 < ^l!. This holds if b > (y/2 - l)a. Since a + b < 1, this is possible only if a < 1/^2. Now 
F(v ' ] - F ^ <a 2 + 6(1 - a) - (a + b) 2 /2 = - 2 -^+2b-b\ 

For a given a, this expression is quadratic in b with a negative second derivative. This is maximized (by 
differentiating) for 6=1 — 2a. Since b < a, this value is possible only if a > 1/3. Similarly, for case 2 to hold 
with this value of 6, we must have a < V2 — 1. Substituting this value of b, the difference in scores is 5a ~ 2 4a+1 - 

Since this is quadratic with a positive second derivative, it is maximized when a has either the minimum or 
maximum possible value. For a = 1/3 this value is 1/9, while for a = \[2 — 1, it is 10 — 7v2- In both cases, it is 
less than 1/8. 

If a < 1/3 the maximum is achieved when b = a. In this case, the score difference is at most (a — 2a 2 ) which 
is maximized for a = 1/4, where the value is 1/8. (This is the worst case). 

For \/2 — 1 < a < 1/V%, the maximum will occur for b = (a/2 — l)a. Substituting this value for 6, the score 
difference is (\/2 — l)(a — a 2 ), which is maximized for a = 1/2, where its value is (a/2 — l)/4 < 1/8. □ 

We believe that the bound for general q is j^+j ■ This bound is not tight as for q = 1 we have already shown 
that the | bound can be tightened to j|. With g = 2 we get a bound of | which is better than j|. 

Entropy potentials and the a-pass algorithm 

As an aside, let us explore the behavior of a-pass on another family of additive potentials — entropy potentials. 
Entropy potentials are of the form: 

C(v) = A ^n.„ log n„, where A > (27) 

V 

The main reason a-pass provides a good bound for Potts potentials is that it guarantees a clique potential of 
at least n 2 where n\ is the count of the most dominant value in the optimal solution. The quadratic term 
compensates for possible sub-optimality of counts of other values. If we had a sub-quadratic term instead, say 
n\ logrti for the entropy potentials, the same bound would not have held. In fact the following theorem shows 
that for entropy potentials, even though a-pass guarantees a clique potential of at least nilogni, that is not 
enough to provide a good approximation ratio. 

Theorem 6.8. a-pass does not provide a bound better than \ for entropy potentials. 

Proof. Consider a counter example where there are R = n + logn values. Divide the values into two sets — A 
with logn values and B with n values. The vertex potentials are as follows: the vertices are divided into logn 
chunks of size n/logn each. If the j th vertex lies in the v th chunk, then let it have a vertex potential of logn 
with value v in A and a vertex potential of logn + e with the j vertex in B. Let all other vertex potentials be 
zero. Also, let A = 1. 

Consider the assignment which assigns the v th value in A to the v th chunk. Its score is 2nlogn — n log logn. 
Now consider a-pass, with a E A. Initially vertex v will be set to the v th value in B. The best assignment found 
by a-pass will assign every vertex to a, for a total score of roughly n + nlogn. If a G B, then again the best 
assignment will assign everything to a for a total score of roughly (n + 1) logn. 

Thus the bound is no better than g as n — * oo. □ 

Thus, a-pass provides good approximations when the clique potential is dominated by the most dominated 
value. We now look at majority potentials, which are linear in the counts {n v } v . Looking at Theorem 16-81 
we expect that a-pass will not have decent approximation guarantees for majority. This is indeed the case. 
We will prove in Section [6.31 that neither a-pass nor a natural modification of a-pass enjoy good approximation 
guarantees. 
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6.3 Clique Inference for majority Potentials 

Recall that MAJORITY potentials are of the form C = / (n), a = argmax^ n v . We consider linear major- 
ity potentials where / a (n) = ^ v w av n v . The matrix W — {w vv >} is not necessarily diagonally dominant or 
symmetric. 

We show that exact MAP for linear majority potentials can be found in polynomial time. We also present 
a modification to the a-pass algorithm to serve as an efficient heuristic, but without approximation guarantees. 
Then we present a Lagrangian relaxation based approximation, whose runtime in practice is similar to a-pass, 
but provides much better solutions. 



6.3.1 Modified a-pass algorithm 

In the case of linear majority potentials, we can incorporate the clique term in the vertex potential, and this 
leads to the following modifications to the a-pass algorithm: (a) Sort the list for a according to the modified 
metric ipi a + w aa — max„^ Q (ipi v + w av ), and (b) While sweeping the list for a, discard all candidate solutions 
whose majority value is not a. 

However even after these modifications, a-pass does not provide the same approximate guarantee as for 
homogeneous Potts potentials, as we prove next. We denote a matrix W as diagonally dominant iff each of its 
diagonal entries are the largest in their corresponding rows. 

Theorem 6.9. The modified a-pass algorithm cannot have an approximation ratio better than \ on linear 
majority potentials with unconstrained W . 

Proof. Consider the degenerate example where all vertex potentials are zero. Let (3 and 7 be two fixed values 
and let the matrix W be defined as follows: wp^ = M + e, wp v = M Vu 7^ [3, 7 and all the other entries in W 
are zero. 

In modified a-pass, when a ^ft the assignment returned will have a zero score. When a = [3, all vertices 
will prefer the value 7, so a-pass will have to assign exactly n/2 vertices as (3 to make it the majority value, thus 
returning a score of ( M + e )" - However, consider the assignment which assigns n/R vertices to each value, with a 
score of (i? — l)Mn/R. Hence the approximation ratio cannot be better than i. □ 

Theorem 6.10. The modified a-pass algorithm cannot provide an approximation bound better than | for linear 
majority potentials even when W is diagonally dominant and each of its rows have equal sums. 

Proof. The proof is by counter example which is constructed as follows. Let the set of values IZ'p be divided 
into two subsets, A and B with k and n — k values respectively, where k < n/2. Let (3 <G B be a fixed value. We 
define W as: 

n — k v, v' e A 
k + l veA,v' = f3 
k + l v,v'&B 
otherwise 

Thus W is diagonally dominant with all rows summing to (n — k)(k + 1). 

The vertex potentials ip are defined as follows. Divide the vertices into k chunks of size n/k each. For the 
v th value in A, each vertex in the v th chunk has a vertex potential of 2(n — k). Further, ipip — 2(n — k) \/i. The 
remaining vertex potentials are zero. 

The optimal solution is obtained by assigning the v th value in A to the v th chunk, with a total score of 
f .2(n - k).k + (n - .k = 3n(n - k). 

In a-pass, consider the pass for a e A. Each vertex i prefers (3 because + w ap = 3(n — k) is the best 
across all values. Thus, the best a-pass generated assignment with majority value a is one where we assign n/2 
vertices a, including the n/k vertices that correspond to the chunk of a. The vertex and clique potentials of this 
assignment are j .2(n — k) + ^.2(n — k) and §(n — k) + § (n — fc), giving a total score of 2n(n — k) + 2 "(^~ fc ) . 
This gives an approximation ratio of |(1 + ^). 

Now consider the pass when a G B. Each vertex i again prefers [3 because V>i/3 + w a p = 2n — fc + 1 is the best 
across all values. If a = [3, the best a-pass assignment is one where all vertices are assigned [3, giving a total 
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cost of n(2n — k + 1). If a ^ (3, then to make a the majority value, a-pass can only output an assignment with 
score less than n(2n — k + 1). In this case, the approximation ratio is no better than 2 3 "~k+i m 

Setting k — y/n and n — > oo, we get the desired result. □ 

However, in practice where the W matrix is typically sparse, our experiments in Section [7. II show that a-pass 
performs well and is significantly more efficient than the exact algorithm described next. 



6.3.2 Exact Algorithm 



Since majority potentials are linear, we can pose the optimization problem in terms of Integer Programs (IPs). 
Assume that we know the majority value a. Then, the optimization problem corresponds to the IP: 



max \ (ip 

z * — ' 

i,v 

i i 

V« : ^z lv = 1, z lv G {0, 1} 



(28) 



We can solve R such IPs by guessing various values as the majority value, and reporting the best overall 
assignment as the output. However, Equation [55] cannot be relaxed to a linear program. This can be easily 
shown by proving that the constraint matrix is not totally unimodular. Alternatively, here is a counter example: 
Consider a 3-node, 3-value graphical model with a zero W matrix. Let the vertex potential vectors be ipo = 
(1,4,0), ipi — (4,0,4), ip2 — (3,4,0). While solving for a — 0, the best IP assignment is 1,0,0 with a score of 
11. However the LP relaxation has the solution z = (0, 1, 0; 1, 0, 0; 1/2, 1/2, 0) with a score of 11.5. 

This issue can be resolved by making the constraint matrix totally unimodular as follows. Guess the majority 
value a, the count k = n a , and solve the following IP: 



max y (ip 

z * — * 

i,v 

Vf ^ a : } z iv < fc, 



Vi. 



k, 

1, z m G {0, 1} 



(29) 



This IP solves the degree constrained bipartite matching problem, which can be solved exactly in polynomial 
time. Indeed, it can be shown that the LP relaxation of this IP has an integral solution. 

Theorem 6.11. The integer 'program in Eauation \29\ has a tight LP relaxation. 

Proof. Denote the constraint matrix of program [2U by A/ m+n \ xmn , and let A\ and Ai denote its first m — 1 
and last n + 1 rows respectively. The n + 1 equality constraints can be converted into '<' constraints by adding 
negative slack variables. For example, Sj + Zi V < 1 and s a + Zi a < k. The variables are now (z,s) T , and 
the extra constraints are s < 0. The new constraint matrix of this system (which has only inequality constraints) 

A 1 

The tightness of the LP relaxation follows if B is totally unimodular. For that, it 



is given by B = 



A 2 I 
I 



suffices to prove that A 



Ax 



is totally unimodular. This is so because then 



A x 

An I 



would be totally 



unimodular, and by extension of the same argument, so would be B. The total unimodularity of A is proven as 
follows. 



19 



Let C be an arbitrary t x t sub-matrix of A. Our argument uses induction on i, with the base case t = 1 
being straightforward. Note that each column in A has exactly two l's. Let B\ denote the first m rows and B2 
denote the remaining n rows of A. 

Case 1: C has a column with all zeros. Then det(C) = and we are done. 

Case 2: C lies totally inside either B\ or Bi- Since there is only one non-zero entry in each column of B\ (or 
B2), pick any column and det(C) will be ±1 times the determinant of its (t — 1) x (t — 1) sub-matrix, depending 
on the column index. So using the induction hypothesis, we get det(C) € {0, ±1}. 

Case 3: C spans rows in B\ and B^. Wlog assume that each column in C has exactly two l's, otherwise 
we can apply the same argument as Case 1 or 2. Now, summing up the rows corresponding to B\ and Bi 
separately, we get Vj : YU^rowaiBA °ij = 1 = J2ierows(B 2 ) Hence the rows of C are linearly dependent and 
so det(C) = 0. " □ 

Thus we can solve O(Rn) such problems by varying a and k, and report the best solution. We believe 
that since the subproblems are related, it should be possible to solve them incrementally using combinatorial 
approaches. 



6.3.3 Lagrangian Relaxation based Algorithm for majority Potentials 

Solving the linear system in Equation [29] is very expensive because we need to solve O(Rn) LPs, whereas the 
system in Equation [55J cannot be solved exactly using a linear relaxation. Here, we look at a Lagrangian 
Relaxation based approach, where we solve the system in Equation [55J but bypass the troublesome constraint 

z ia • 

We make use of the Lagrangian Relaxation technique to move the troublesome majority constraint to the 
objective function. Any violation of this constraint is penalized by a positive penalty term. Consider the following 
modified program, also called the Lagrangian: 

£(7) = L(7i, . . . ,7n) = max^(i/i„ + w av )z iv + ^ 7»($Z Zia ~ Ziv ^ 

i,v v i i 

Vi : z "< = 1 . z » € {°= !> ( 3 °) 

V 

For 7 > 0, and feasible assignments z, £(7) is an upper bound for our objective in Equation [28] Thus, we 
compute the lowest such upper bound: 

L*=minL( 7 ) (31) 

7>0 

Since the penalty term in Equation 1301 is linear in z, we can merge it with the first term to get another set 
of modified vertex potentials: 



IpTv — ^iv + W av ~lv + { V (32) 




Equation [30] can now be rewritten in terms of tp a , with the only constraint that z be a assignment: 

maxVV^z™ 

Z ' 

i,v 

Vi:^z r „ = l, z iv e{0,l} (33) 

V 

Hence, £(7) can be computed by independently assigning each vertex i to its best value , viz. argmax,,^^,. 

We now focus on computing L* . We use an iterative approach, beginning with 7 = 0, and carefully choose 
a new 7 at each step to get a non-increasing sequence of L(j)'s. We describe the method of choosing a new 7 
later in this section, and instead outline sufficient conditions for termination and detection of optimality. 
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Theorem 6.12. z* and 7* are optimum solutions to Equations \28l and \31\ respectively if they satisfy the condi- 
tions: 

Vu: ]>>»^E Z » (34) 

i i 

V«: |7;(E Z »-E<«)I=0 (35) 

i i 

Theorem 16 . 121 holds only for fractional z*. To see how, consider an example with 3 vertices and 2 values. Let 
'tjjio + w a o > tpn + u>ai f° r &U * an d ol. During Lagrangian relaxation with a = 1, initially 7 = will cause all 
vertices to be assigned value 0, violating Equation[34] Since the count difference J^i z m ~ Yli z n G {^l; ±3}, 
any non-zero 70 will violate Equation l35l Subsequent reduction of 70 to zero will again cause the original violation 
of Equation[34] Consequently, one of Equations l34l and l35l will never be satisfied and the algorithm will oscillate. 

To tackle this, we relax Equation [35] to \jv(J2i z iv ~ J2i z ia)\ — e ; where e is a small fraction of an upper 
bound on "f Vl whose computation is illustrated later. This helps in reporting assignments that respect the 
majority constraint in Equation [34] and are close to the optimal. 

The outline of the algorithm is described in Figure [U We now discuss a few possible approaches to select a 
new 7 at every step. 



Subgradient Optimization 

This approach can be used to change all components of 7 in a single step. Subgradient optimization generates 
a sequence of direction vectors {d 1 , d 2 , . . .} and positive step sizes {771, 772, ■ ■ ■}■ At the k th step, the 7 vector is 
changed as: 

7„ <-max(0,7„+?7fed fe ) (36) 

In its simplest form, the direction d fe is the violation vector Q^. z% v — X)j z ia)v=i...R- Thus, if a value v has a 
count greater than a, then 7„ will be increased to take some vertices away from v. In practice though, d k is 
usually a convex combination of the violation vector and the previous direction d* -1 . This helps in avoiding 
oscillations of the kind where j k+2 is very close to 7 fc , while simultaneously moving closer to the optimum. 

The subgradient optimization framework allows various ways to choose the step sizes. For example, if we 
choose to set the direction using only the violation vector, then a sequence {771,7721 ■ • ■} of step sizes satisfying 
(i) limfe^oo rj k — and (ii) J2kLo Vk — 00 will ensure asymptotic convergence [IT] . These are not the only set of 
sufficient conditions that guarantee convergence. Practical implementations often compute 77k using the current 
value of £(7), current violations, and a few user defined parameters. 

However, during experimentation the degrees of freedom in choosing the step sizes posed a big problem for 
us. Since a single step size is shared across all components of 7, a large step size moved everything to a, while a 
small step size considerably slowed down convergence. Data independent approaches to choosing step sizes also 
failed to converge in a reasonable number of iterations. In general, subgradient optimization is known to require 
very careful tweaking of the step sizes across iterations in order to achieve meaningful convergence speeds [llj . 
For this reason, we looked at alternate approaches to change 7. 



Golden Search based Coordinate Descent 

If all components of 7 except one (say 7„), are kept fixed, then L{p/) is a quasi-convex function of j v . Thus, it 
has a unique global minima, which can be found using golden search, which is an efficient line search method. 
We choose the value v, whose corresponding violation is the highest in magnitude. 

Golden search requires lower and upper bounds on 7„ and evaluates £(7) at various 7's inside that interval. 
As before, £(7) can be easily obtained by a computing an assignment which is vertex-optimal wrt the ip a, s. 
We use the trivial lower bound of zero, and estimate a good upper bound from the current solution state. If 
currently Zi v < Y] v z ua , then j v (which is a penalty parameter) can be decreased, and therefore the current 
value of 7„ can serve as an upper bound. On the other hand, if we start increasing j v , then one by one, the 
vertices currently assigned v will switch to their next best values, and by a particular increased value of j v , all 
vertices assigned v would have flipped. There is no need to increase 7^ beyond this point, so we use this value 
of 7„ as our upper bound, which can thus be summarized as : 

UB(v) = max 5i 
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where (denoting the current second best value of i by /?), 




av 



E/^Tu) = a 



In spite of its simplicity, Golden Search suffers from a few drawbacks that come to light during experimen- 
tation. First, it always performs 9 (log UB(v)) evaluations of L(j). This can drive up the overall runtime of the 
algorithm. Second, a change in 7„ affects the modified vertex potentials for values v and a (Equation I32p. Thus, 
a large change in j v may flip many vertices to a, causing a big change in the current assignment z, and we 
may end up spending the next few iterations repairing these changes. Third, line search methods zero-in on the 
optima by evaluating the objective at various points and choosing a sub-interval accordingly. In our case, due 
to the integrality of z, ties can happen at many places in an interval. In such a scenario, arbitrary tie resolution 
may cause a wrong sub-interval to be chosen for further consideration. 

To tackle these issues, we use a more conservative coordinate descent approach, which we describe next and 
also use in our experiments. 

Conservative Coordinate Descent 

We can avoid a large number of flips in the current assignment if we replace our golden search method with a 
more conservative one. Let v be the worst violating value in the current iteration. We will first consider the case 
when its count exceeds that of a, so that Equation [34] does not hold. 

To decrease the count of v, we need to increase 7„. Let i be a vertex currently assigned v and let f3(i) be 
its second most preferred value under the vertex potentials ip". The vertex j = argmax i:z ^ =1 ?/'^( i - ) — tpf v is the 
easiest to flip. So we increase j v till the point when this difference becomes zero. The new value of 7„ is therefore 
given by: 



where Aip(i,v,v') denotes ipiv + Wav — Vw — w av i. It is possible that by flipping vertex j, /3(j) now violates 
Equation [34] Further, increasing j v also increases i/j? a , so some other vertices that are not assigned v may also 
move to a. However since the change is conservative, we expect this behavior to be limited. In our experiments, 
we found that this conservative scheme converges much faster than golden section over a variety of data. 

We now look at the case when Equation [34] is satisfied by all values but Equation [35] is violated by some 
value v. In this scenario, we need to decrease j v to decrease the magnitude of the violation. Here too, we 
conservatively decrease 7„ barely enough to flip one vertex to v. If i is any vertex not assigned value v and f3(i) 
is its current value, then the new value of j v is given by: 



Note that the arguments of Equations |3~T1 and [551 are the same. In this case too, in spite of a conservative move, 
more than one vertex marked a may flip to some other value, although at most one of them will be flipped to v. 
As before, the small magnitude of the change restricts this behavior in practice. 

7 Applications and Experiments 

We present results of three different experiments. 

First, in Section [77X1 we compare our clique inference algorithms against applicable alternatives in the liter- 
ature. We compare the algorithms on speed and accuracy of the output assignments. For Potts potentials, we 
show that a-pass is superior to the TRW-S and min-cut based algorithms. For majority potentials, we compare 
the modified a-pass and Lagrangian relaxation based algorithms against the exact LP-based approach and the 
iterated conditional modes (ICM) algorithm. 




) 0(0 f a 

jtvltf) /3(i)=a 



(37) 



= max 




(38) 



22 



Input: ip, W, a, maxlters, tolerance 
Output: approximately best assignment v 
7^0; 
iter <— 0; 

z <— Assignment with all vertices assigned a; 
while i£er < maxlters do 

Compute £(7) (Equation [3D]), let z be the solution; 

if F(z) > F(z) then 
z <— z; 

end 

(v, A) Worst violator and violation (Equations 1341 and 155)) : 
if A < tolerance then 

We are done, L* = £(7); 

break break; 
else if using subgradient optimization then 

Compute the new direction d l * er and step-size rf ter ; 

Modify 7 using Equation [36l 
else 

Modify 7„ using golden search or conservative descent; 
end 

iter <— iter+1; 
end 

Construct value assignment v from z; 
return v 

Algorithm 2: Compute L* 



Second, in Section lT^l we demonstrate the application of the generalized collective framework on domain adaptation 
and show that using a good set of properties can bring down the test error significantly. Finally, in Section 17.31 
we show that message passing on the cluster graph is a more effective way to perform inference compared to 
alternatives such as ordinary belief propagation, and enjoys better convergence speeds. 

7.1 Clique Inference Experiments 

In this section, we compare our algorithms against sequential tree re-weighted message passing (TRW-S) and 
graph-cut based inference for clique potentials that are decomposable over clique edges; and with ICM when 
the clique potentials are not edge decomposable. We compare them on running time and quality of the MAP. 
Our experiments were performed on both synthetic and real data. 

Synthetic Dataset: We generated cliques with 100 vertices and R — 24 values by choosing vertex potentials at 
random from [0, 2] for all values. A Potts version (potts) was created by gradually varying A , and generating 
25 cliques for every value of A. We also created analogous entropy, makespan and MAKESPAN2 versions of the 
dataset by choosing entropy, linear makespan (Amax„n„) and square makespan (Amax„nJ) clique potentials 
respectively. 

For MAJORITY potentials we generated two kinds of datasets (parameterized by A) : (a) MAJ-DENSE obtained 
by generating a random symmetric W for each clique, where W vv = A was the same for all v and W vv i € 
[0, 2A] (v ^ v'), and (b) maj-SPARSE from symmetric W with Wij € [0, 2A] for all i,j, roughly 70% of whose 
entries were zeroed. 

Of these, only POTTS is decomposable over clique edges. 
CoNLL Dataset: The CoNLL 2003 dataselQ is a popular choice for demonstrating the benefit of collective 
labeling in named entity recognition tasks. We used the BIOU encoding of the entities, that resulted in 20 labels. 
We took a subset of 1460 records from the test set of CoNLL, and selected all 233 cliques of size 10 and above. 
The median and largest clique sizes were 16 and 259 respectively. The vertex potentials of the cliques were set 
by a sequential Conditional Random Field trained on a separate training set. We created a Potts version by 

: http : //cnts .uia. ac .be/conll2003/ner/ 
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Figure 4: Comparison with TRW-S, Graph-cut and ICM 



setting A = 0.9/n, where n is the clique size. Such a A allowed us to balance the vertex and clique potentials for 
each clique. A majority version was also created by learning W discriminatively in the training phase. 

All our algorithms were written in Java. We compared these with C++ implementations of the TRW-£H, 
and graph-cut based expansion alg orithmsO [51 [23l HH [4] . All experiments were performed on a Pentium-IV 3.0 
GHz machine with four processors and 2 GB of RAM. 



7.1.1 Edge decomposable potentials 



In Figure 4(a) 



Figures 4(a) and 4(b) compare the performance of TRW-S vs a-pass on the two datasets. 
varied A uniformly in [0.8, 1.2] with increments of 0.05. This range of A is of special interest, because it allows 
maximal contention between the clique and vertex potentials. For A outside this range, the MAP is almost always 
a trivial assignment, viz. one which individually assigns each vertex to its best value, or assigns all vertices to a 
single value. 

We compare two metrics — (a) the quality of the MAP score, captured by the ratio of the TRW-S MAP score 
with the a-pass MAP score, and (b) the runtime required to report that MAP, again as a ratio. Figure 4(a) shows 
that while both the approaches report almost similar MAP scores, the TRW-S algorithm is more than 10 times 
slower in more than 80% of the cases, and is never faster. This is expected because each iteration of TRW-S costs 
0{n 2 ), and multiple iterations must be undertaken. In terms of absolute run times, a single iteration of TRW-S 
took an average of 193ms across all cliques in potts, whereas our algorithm returned the MAP in 27.6 + 8.7ms. 
Similar behavior can be observed on CoNLL dataset in Figure |4(b)| Though the degradation is not as much as 
before, mainly because of the smaller average clique size, TRW-S is more than 5 times slower on more than half 
the cliques. 

Figure 4(c) shows the comparison with Graph-cut based expansion. The MAP ratio is even more in favor 
of a-pass, while the blowup in running time is of the same order of magnitude as TRW-S. This is surprising 



^http : //www. adastral .ucl . ac .uk/~ vladkolm/papers/TRW-S .html 



http : //vision. middlebury . edu/MRF/ 
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Figure 5: Comparing AlphaPass, ICM, Lagrangian Relaxation and Exact on majority potentials 



because based on the experiments in [23j we expected this method to be faster. One reason could be that their 
experiments were on grid graphs whereas ours are on cliques. 

7.1.2 Non-decomposable potentials 

In this case, we cannot compare against the TRW-S or graph-cut based algorithms. Hence we compare with the 
ICM algorithm that has been popular in such scenarios [15] . We varied A with increments of 0.02 in [0.7, 1.1) 
and generated 500 cliques each from potts, maj-dense, maj-SPARSE, entropy, makespan and MAKESPAN2. 
We measure the ratio of MAP score of a-pass with ICM and for each ratio r we plot the fraction of cliques where 
a-pass returns a MAP that results in a ratio better than r. Figure [4(d) | shows the results on all the potentials 
except majority. The curves for linear and square makespan lie totally to the right of ratio = 1 , which is expected 
because a-pass will always return the optimal answers for those potentials. For Potts too, a-pass is better than 
ICM for almost all the cases. For entropy, a-pass was found to be significantly better than ICM in all the cases. 
The runtimes of ICM and a-pass were similar. 

Majority Potentials 



In Figures [5(a)[ and [5(b)] we compare ICM, Lagrangian Relaxation (LR) and modified a-pass with the LP-based 



exact method on synthetic data. The dotted curves plot, for each MAP ratio r, the fraction of cliques on which 
ICM (or LR or modified a-pass) returns a MAP score better than r times the true MAP. The solid curve plots 
the fraction of cliques where LR returns a MAP score better than r times the ICM MAP. On maj-dense, both 
modified a-pass and ICM return a MAP score better than 0.85 of the true MAP, with ICM being slightly better. 
However, LR outperforms both of them, providing a MAP ratio always better than 0.97 and returning the true 
MAP in more than 70% of the cases. In MAJ-SPARSE too, LR dominates the other two algorithms, returning 
the true MAP in more than 80% of the cases, with a MAP ratio always better than 0.92. The solid curve in 



Figure 5(b) shows that on average, LR returns a MAP score 1.15 times that of ICM. Thus, LR performs much 
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better than its competitors across dense as well as sparse majority potentials. 



The results on CoNLL dataset, whose W matrix is 85% sparse, are displayed in Figure 5(c) ICM, modified 



a-pass and LR return the true MAP in 87%, 95% and 99% of the cliques respectively, with the worst case MAP 
ratio of LR being 0.97 as opposed to 0.94 and 0.74 for modified a-pass and ICM respectively. 

Figure 5(d) displays runtime ratios on all CoNLL cliques for all three inexact algorithms. ICM and modified 
a-pass are roughly 100-10000 times faster than the exact algorithm, while LR is roughly twice as expensive as 
ICM and modified a-pass. Thus, for practical majority potentials, LR and modified a-pass seem to quickly 
provide highly accurate solutions. 

From now on, while doing the top-level message passing on the cluster graph, we shall use the a-pass and 
Lagrangian-relaxation based algorithms for computing messages from a clique, in the presence of Potts and 
majority potentials respectively. 



7.2 Domain Adaptation 

We now move on the generalized collective inference framework, and show that a good set of properties can help 
us in domain adaptation. We focus on the bibliographic task, where the aim is to adapt a sequential model across 
widely varying publications pages of authors. Our dataset consists of 433 bibliographic entries from the web- 
pages of 31 authors, hand-labeled with 14 labels such as Title, Author, Venue, Location and Year. Bibliographic 
entries across different authors differ in various aspects like label-ordering, missing labels, punctuation, HTML 
formatting and bibliographic style. 

A fraction of 31 domains were used to train a baseline sequential model. The model was trained with the 
LARank algorithm of 3 , using the BCE encoding for the labels. We used standard extraction features in a 
window around each token, along with label transition features [20] . 

For our collective framework, we use the following decomposable properties: 

Pi(x, y) = First non-Other label in y 

p 2 (x, y) = Token before the Title segment in y 

p 3 (x, y) = First non-Other label after Title in y 

p 4 (x, y) — First non-Other label after Venue in y 



Inside a domain, any one of the above properties will predominantly favor one value, e.g. p 3 might favor the value 
'Author' in one domain, and 'Date' in another. Thus these properties encourage consistent labeling around the 
Title and Venue segments. We use Potts potential for each property, with A = 1. 

Some of these properties, e.g. p 3 , operate on non-adjacent labels, and thus are not Markovian. This can be 
easily rectified by making 'Other' an extension of its predecessor label, e.g. an 'Other' segment after 'Title' can 
be relabeled as 'After-Title'. 

The performance results of the collective model with the above properties versus the baseline model are 
presented in Table [3] For the test domains, we report token-Fl of the important labels — Title, Author and 
Venue. The accuracies are averaged over five trials. The collective model leads to upto 25% reduction in the test 
error for Venue and Title, labels for which we had defined related properties. The gain is statistically significant 
(p < 0.05). The improvement is more prominent when only a few domains are available for training. Figure [6] 
shows the error reduction on individual test domains for one particular split when five domains were used for 
training and 26 for testing. The errors are computed from the combined token Fl scores of Title, Venue and 
Author. For some domains the errors are reduced by more than 50%. Collective inference increases errors in 
only two domains. 

Finally, we mention that for this task, applying the classical collective inference setup with cliques over 
repeated occurrences of words leads to very minor gains. In this context, the generalized collective inference 
framework is indeed a much more accurate mechanism for joint labeling. 

7.3 Collective Labeling of Repeated Words 

We now establish that even for simple collective inference setups without any multi-clique properties, message 
passing on the cluster graph (abbreviated as CI) is a better option. We consider information extraction over 
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89.4 90.0 



Table 3: Token-Fl of the Collective and Base Models on the Domain Adaptation Task 
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Figure 6: Per-domain error for the base and collective inference (CI) model 



text records, and define cliques over multiple occurrences of words. We create two versions of the experiment 
— with Potts and majority potentials on the cliques respectively. 

Since the Potts potential is decomposable over the clique edges, we compare CI against the TRW-S algorithm 
of [13] which is the state of the art algorithm for belief propagation. We compare the majority potential version 
against the stacking approach of [15] . 

We report results on three datasets — the Address dataset consisting of roughly 400 non-US postal addresses, 
the Cora dataset [18] containing 500 bibliographic records, and the CoNLL'03 dataset. The training splits were 
30%, 10% and 100% respectively for the three datasets, and the parameter A for Potts was set to 0.2,1 and 0.05. 
The majority parameter W was learnt generatively through label co-occurrence statistics in cliques seen in the 
training data. 

Table 2] reports the combined tokcn-Fl over all labels except 'Other'. Unless specified, all the approaches 
post statistically significant gains over the base model. For majority potentials, CI is superior to the stacking 
based approach. For the Potts version, there is no clear winner as TRW-S achieves Fl slightly better or close to 
those for CI. But collective inference with majority potentials is more accurate than with Potts. 

Exploring Potts potentials further, we present Figure!?] where we plot the accuracy of the two methods versus 
the number of iterations. CI achieves its best accuracy after just one round, whereas TRW-S takes around 20 
iterations. In terms of clock time, an iteration of TRW-S cost ~ 3.2s for CORA, and that of CI cost 3s, so CI 
is roughly an order of magnitude faster than TRW-S for the same accuracy levels. The comparison was similar 
for the Address dataset. 

Hence the CI approach is applicable for all symmetric potentials, and exploits their form to get higher 
accuracies faster than other methods. 
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81.5 
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81.9 


89.7 


88.8 
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81.9 


89.7 




MAJORITY 


CI 


82.2 


89.6 


88.8 




Stacking 


81.7* 


87.5| 


87.8 



Table 4: Token Fl scores of various approaches on collective labeling with repeated words. Results averaged over 
five trials for Address and Cora. A '*' denotes statistically insignificant difference (p>0.05), | means statistically 
significant loss. 
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Figure 7: Accuracy with iterations for CI vs TRW-S on Cora and Address. 



8 Conclusions and Future Work 

We proposed a generalized collective inference framework based on decomposable properties and symmetric 
potential functions to maintain conformity in the labeling of multiple MRFs. We perform joint MAP inference 
using a cluster graph that defines special separator variables based on property values. The messages inside 
MRF clusters were modified to make them property-aware. Special combinatorial algorithms were used at the 
property cliques to compute outgoing messages. 

We demonstrated the effectiveness of the framework by applying it on a domain adaptation task with a rich 
set of properties. We also established that message passing on the cluster graph is an effective solution vis a vis 
cluster-oblivious approaches based on ordinary belief propagation. 

Algorithmically, we presented potential-specific combinatorial algorithms for inference in a clique. We gave a 
Lagrangian relaxation method for generating messages from a clique with majority potential. This algorithm is 
two orders of magnitude faster than an exact algorithm and more accurate than other approximate approaches. 
We also presented the a-pass algorithm for Potts potentials, which enjoys a tight approximation guarantee of 
This algorithm is sub-quadratic in the clique size. We showed that a-pass is faster and more accurate than 
alternatives such as TRW-S and graph-cuts. 

Future directions 

We wish to automate the selection of important decomposable associative properties Another issue is the domain- 
adaptive training of the property parameters (e.g. A for Potts). Joint training of these parameters with the 
baseline model would require expensive calls to the collective inference algorithm at each step, so a cheaper 
alternative has to be investigated. 

Next, our property clusters are presently defined as cliques with symmetric potentials, which have limited 
expressive power So we are interested in looking at dense weighted subgraphs instead of cliques, thus modeling 
that not all vertex-pairs have equal associativity. 
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Finally, we wish to seek more applications for collective inference, and deploy collective inference on a large 
scale. Although our cluster message passing based solution is distributed and inherently parallelizable, the clique 
participants might lie on different physical machines. This, and some other interesting scaling issues will crop 
up as we try to run collective inference on a web scale. 
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